Sentence similarity scoring using neural network distillation

ABSTRACT

The disclosure herein describes a system and method for attentive sentence similarity scoring. A distilled sentence embedding (DSE) language model is trained by decoupling a transformer language model using knowledge distillation. The trained DSE language model calculates sentence embeddings for a plurality of candidate sentences for sentence similarity comparisons. An embedding component associated with the trained DSE language model generates a plurality of candidate sentence representations representing each candidate sentence in the plurality of candidate sentences which are stored for use in analyzing input sentences associated with queries or searches. A representation is created for the selected sentence. This selected sentence representation is used with the plurality of candidate sentence representations to create a similarity score for each candidate sentence-selected sentence pair. A retrieval component identifies a set of similar sentences from the plurality of candidate sentences responsive to the input query based on the set of similarity scores.

BACKGROUND

Language models are frequently used to perform various linguistic taskssuch as machine translation, sentiment analysis, question answering andsentence similarity. Many of these language models utilize scoring ofsentence-pairs involving a cross-attention (CA) operation in which eachword in a sentence A attends to all words in a sentence B and viceversa, excluding the fact that each word attends to all other words inthe same sentence as well. In some language models, CA is applied in acascade throughout a stack of multi-head attention layers to computesimilarity between sentence pairs by analyzing the relations betweenindividual words. However, this entails an excessively demandinginference phase which is time-consuming and computationally expensive.This can create a computational bottleneck. Moreover, CA models aretypically not trained to produce sentence embeddings with respect to thetask at hand resulting in additional computational inefficiency.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

Some examples provide a system and method for training a distilledsentence embedding (DSE) language model by decoupling a transformerlanguage model using knowledge distillation to calculate sentenceembeddings for a plurality of candidate sentences. A plurality ofcandidate sentence representations are precomputed by the trained DSElanguage model and stored in a data storage device. A set of similarityscores are generated for each candidate sentence in the plurality ofcandidate sentences and a selected sentence associated with an inputquery based on a representation of the selected sentence and arepresentation of each candidate sentence in the plurality of candidatesentences. The set of similarity scores includes a similarity score foreach candidate sentence compared with the selected sentence. A set ofsimilar sentences are selected from the plurality of candidate sentencesbased on the set of similarity scores. A response to the input query isgenerated. The response includes the selected set of similar sentences.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed examples are described in detail below with reference tothe accompanying drawing figures listed below:

FIG. 1 is an exemplary block diagram illustrating a system 100 fortext-based similarity scoring using knowledge distillation from ateacher model to a student model.

FIG. 2 is an exemplary block diagram illustrating a distilled sentenceembedding (DSE) model 200 for generating similarity scores.

FIG. 3 is an exemplary block diagram illustrating training of a DSEmodel via knowledge distillation from a pretrained teacher languagemodel.

FIG. 4 is an exemplary block diagram illustrating a database 400 storingprecomputed candidate sentence representations.

FIG. 5 is an exemplary flow chart illustrating operation of thecomputing device to train a DSE language model by a pretrained teacherlanguage model.

FIG. 6 is an exemplary flow chart illustrating operation of thecomputing device to perform precomputation of candidate sentenceembedding.

FIG. 7 is an exemplary flow chart illustrating operation of thecomputing device to perform sentence pair similarity computation by aDSE model trained via knowledge distillation from a pretrainedtransformer language model.

FIG. 8 is an exemplary table illustrating time comparisons between theDSE model and the trained transformer language model performingsimilarity computation between sentence pairs with a catalog of onethousand candidate sentences.

FIG. 9 is an exemplary table illustrating time comparisons between theDSE model and the trained transformer language model performingsimilarity determination between a query and a catalog of one hundredthousand candidate sentences.

FIG. 10 is exemplary block diagram illustrating an example computingenvironment suitable for implementing some of the various examplesdisclosed herein.

Corresponding reference characters indicate corresponding partsthroughout the drawings.

DETAILED DESCRIPTION

The emergence of self-attentive models, such as the transformer languagemodel, generative pre-training (GPT) models, bidirectional encoderrepresentations from transformers (BERT) model, XLNet and other naturallanguage processing (NLP) models significantly advanced thestate-of-the-art in various linguistic tasks such as machinetranslation, sentiment analysis, question answering and sentencesimilarity. These models are built upon a stack of self-attention layersthat enable each word to attend other words in a sentence.

In the latest models, such as BERT and XLNet, self-attention is appliedin a bidirectional manner. This is different from conventional languagemodels, in which each word in a sentence is conditioned solely on itspreceding words. In addition, the architectures in support sentence-pairinput, endow these models with the ability to infer sentence similarity.However, this capability entails a non-negligible computational cost.Moreover, CA is repeatedly applied in a cascade throughout a stack ofmulti-head attention layers. This CA entanglement is a double-edgedsword. On the one hand, it computes similarity between sentence pairs byanalyzing the relations between individual words a∈A and b∈B. On theother hand, it entails an excessively demanding inference phase in termsof time and computational power. Where a CA model T and a set ofcandidates X that contains N sentences, the task for the language modelis to retrieve the topmost similar sentences in X with regard to a newquery sentence q. A naïve solution is to compute the similarity betweeneach sentence x∈X and q, which amounts to N applications of T (scoringeach sentence-pair (q, x) using T). In other words, the propagation ofthe entire candidates set X through T is necessary to produce thesimilarity scores with respect to a single query q.

A second problem with CA models is the fact that they are not trained toproduce sentence embeddings with respect to the task at hand. Whileseveral types of heuristics can be employed to produce sentenceembedding (e.g., summing the several last hidden token representations,using the CLS hidden token as a sentence representation, etc.), none ofthem are truly justified. These operations are employed after thetraining phase is over and are not directly related to the originaltraining objective. This problem is a key differentiator between CAmodels and other models that inherently support sentence embedding.

Aspects of the disclosure provide scalable attentive sentence similarityscoring using neural network distillation to decouple a transformerlanguage model and generate a trained sentence embedding model.Decoupling refers to decoupling the sentence embedding from thesimilarity function analysis. Candidate sentence embedding is performedto precompute candidate sentence representations prior to receiving aquery. Each sentence is processed independently rather than as asentence pair to improve processing speed and reduce resource usage.

In other examples, the similarity scoring is performed by a distilledsentence embedding (DSE) model for computing sentence similaritycomparisons based on sentence representations. This enables improvedsentence similarity comparisons that reduces processing time and reducesprocessor resource usage.

FIG. 1 is an exemplary block diagram illustrating a system fortext-based similarity scoring using knowledge distillation from ateacher model to a student model. The system 100 in some examplesincludes a pretrained teacher language model 102. The pretrained teacherlanguage model 102 is a general language model or an unsupervisedpre-trained language model. The pretrained teacher language model 102can include, without limitation, a BERT language model, ELMo deepcontextualized word representations model, XLNET, ALBERT or any othernatural language processing (NLP) transformer-type of language model.The pretrained teacher language model 102 is utilized to train adistilled sentence embedding (DSE) student model 104.

The pretrained teacher language model 102 and/or the DSE student model104 executes on a computing device or a cloud server. The computingdevice can include any device executing computer-executable instructions(e.g., as application programs, operating system functionality, or both)to implement the operations and functionality associated with theuntrained language model 102, such as, but not limited to, the computingdevice 1000 in FIG. 10 below. The computing device can be implemented asa mobile computing device or any other portable device. The computingdevice can also include less-portable devices such as servers, desktoppersonal computers, kiosks, or tabletop devices. Additionally, thecomputing device can represent a group of processing units or othercomputing devices. In some examples, the computing device has at leastone processor and a memory.

The pretrained teacher language model 102 and/or the DSE student model104 can also be implemented on a cloud server. A cloud server may be alogical server providing services to one or more computing devices orother clients. A cloud server may be hosted and/or delivered via anetwork for enabling communications with remote computing devices, suchas, but not limited to, a local area network (LAN), a subnet, a widearea network (WAN), a wireless (Wi-Fi) network, the Internet or anyother type of network. A cloud server can be implemented on one or morephysical servers in one or more data centers. In other examples, thecloud server may be associated with a distributed network of servers.

The DSE student model is a deep learning, machine learning languagemodel for learning a sentence embedding via knowledge distillation 106from CA models. The pretrained teacher language model 102 is a trainedCA teacher model. The DSE student model 104 is being trained by thepretrained teacher language model 102 in some examples.

The DSE student model is trained to map sentences to a set of featurevectors 108 in a latent space, in which the application of a low-costsimilarity function approximates the similarity score obtained by thepretrained teacher language model 102 for the corresponding sentencepair 110. Specifically, the DSE student model 104 employs a pairwisetraining procedure in which each pair of sentences (A, B) and pretrainedteacher language model 102 similarity score (TAB) 112 (that is obtainedby the teacher model for the specific sentence-pair) is treated as atraining example.

The DSE student model 104 consists of parametric embedding andsimilarity functions. The embedding function maps the sentence A and thesentence B to one or more vectors, on which the similarity function isapplied to produce a student model similarity score (SAB) 114. Theembedding function maps a single sentence 116 to a vector 118 to createor precompute a sentence representation 120 for each sentence. Theprecomputed vectors 108 are stored in a data store, such as, but notlimited to, the data storage device 122.

The data storage device 122 can include one or more different types ofdata storage devices, such as, for example, one or more rotating disksdrives, one or more solid state drives (SSDs), and/or any other type ofdata storage device. The data storage device 122 in some non-limitingexamples includes a redundant array of independent disks (RAID) array.In other examples, the data storage device 122 includes a database.

The data storage device 122 in some examples is included within acomputing device, such as, but not limited to, the computing device 1000in FIG. 10 below. In other examples, the data storage device 122 isattached to the computing device, plugged into the computing device orotherwise associated with the computing device. In yet other examples,the data storage device 122 includes a remote data storage accessed bythe computing device via a network, such as a remote data storagedevice, a data storage in a remote data center, or a cloud storage.

The network may be implemented by one or more physical networkcomponents, such as, but without limitation, routers, switches, networkinterface cards (NICs), and other network devices. The network is anytype of network for enabling communications with remote computingdevices, such as, but not limited to, a local area network (LAN), asubnet, a wide area network (WAN), a wireless (Wi-Fi) network, or anyother type of network. In this example, the network is a WAN, such asthe Internet. However, in other examples, the network is a local orprivate LAN.

The DSE student model in some examples applies a loss function to make acomparison 124 between the training model similarity score (SAB) scoreand the training model similarity score (TAB). This provides a feedbackloop which is used to determine whether training of the DSE studentmodel is completed sufficiently. If the scores generated by the DSEstudent model vary from the scores generated by the teacher model, thetraining of the student model continues.

During the training phase, the DSE student model parameters (includingthe embedding and similarity functions) are learned via stochasticgradient descent with respect to a loss function that compares thestudent output score SAB to the teacher model score TAB.

In the inference phase, the student model maps an input sentence-pair,such as, but not limited to, the sentence pair 110, to a vector-pairusing the embedding function. The DSE student model 104 computes thevector-pair similarity score 114 using the similarity function. In thismanner, the DSE student model 104 performs a disentanglement thatenables the precomputation of the candidate sentence embeddings inadvance. As a result, for ranking and retrieval tasks, the computationalcomplexity of a query reduces to a single application of the student DSEmodel to the query sentence q, followed by N applications of thelow-cost similarity function (for each vector-pair).

In some examples, the DSE student model 104 precomputes a representationfor each candidate sentence in the plurality of candidate sentencesbeing compared (for comparison) with the selected (primary) sentence. Ifthere are ten candidate sentences, the DSE student model 104 performs anembedding function on each of the sentences to generate ten sentencerepresentations. Each of the precomputed candidate sentencerepresentations in a plurality of representations 128 associated withthe plurality of candidate sentences are stored in a data store, such asthe data storage device 122.

When the DSE student model 104 receives a new input query sentence, theDSE student model converts the sentence to a selected sentencerepresentation for comparison with the representation of each candidatesentence stored in the plurality of representations 128 in a database(table) or other data store on the data storage device 122.

FIG. 2 is an exemplary block diagram illustrating a trained DSE model200 for generating similarity scores. The DSE model 200 is a trainedlanguage model for performing sentence similarity comparisons between aselected primary sentence and two or more candidate sentences. The DSEmodel 200 is a model such as, but not limited to, the DSE student model104 after training of the model is completed.

In some examples, the DSE model 200 receives a query 202 including text204, such as in the form of a sentence 206 input 208 by a user or otherentity, such as an application. The query 202 may be referred to as aninput query.

The query 202 in some examples is a request including a sentence whichis to be compared to one or more candidate sentences. A candidatesentence can include, for example, a pre-generated or pre-preparedresponse sentences. In some examples, a query is a question or searchterm. The candidate sentences are sentences responsive to the questionor sentences including information matching the search term(s) in thequery. In still other examples, the query is a frequently askedquestion, a search parameter, or other information request.

An embedding component 210 performs an embedding function on thesentence 206 to map the sentence 206 to a set of one or more vectors212, such as, but not limited to, a feature vector. Embedding refers tocreating a representation of the sentence 206. The representation insome non-limiting examples is a numerical value. The trained DSE modelin this example is trained to create high-quality sentence embeddings.

A scoring component 214 in some examples performs a comparison functionon the selected sentence 206 representation with the representation ofeach of the candidate sentences in turn to generate a similarity scorefor each pair of sentences. For example, when an input sentence A isreceived, if there are four candidate sentences B, C, D and E, thescoring component generates four similarity scores ranking thesimilarity of the input sentence with each of the four candidatesentences. In this example, the scoring component 214 creates asimilarity score 216 for the sentence A compared with the candidatesentence B, a similarity score 218 for the sentence A and the sentenceC, a similarity score 220 for the sentence A and the candidate sentenceD, and another similarity score for the sentence A compared with thecandidate sentence E.

A retrieval component 224 retrieves a set of one or more similarsentences 226 from the plurality of candidate sentences 228 having thehighest similarity scores indicating candidate sentences which are thetop “K” matching or similar sentences as compared with the selectedsentence 206.

FIG. 3 is an exemplary block diagram illustrating training of a DSEmodel via knowledge distillation from a pretrained teacher languagemodel. The DSE model is a model for performing sentence similaritycomparisons, such as, but not limited to, the DSE student model 104 inFIG. 1 and/or the DSE language model 200 in FIG. 2.

In some examples, the vocabulary

=

represents the vocabulary of all supported tokens. The term Y is definedto be the set of all possible sentences that can be generated using thevocabulary

. The teacher model 300 T: Y×Y→

in some examples can be a fine-tuned BERT language model. The teacherlanguage “T” receives a sentence-pair (y,z)∈Y×Y and outputs a similarityscore T_(yz)

T(y,z). In some examples, T is not necessarily a symmetric function.

In some examples, ψ, ϕ: Y→

^(d) represents sentence embedding functions that embed a sentence y∈Yin a d-dimensional latent vector space. The usage of different sentenceembedding functions, embedding function ψ 302 and embedding function ϕ304, is due to the fact that T is not necessarily a symmetric function.For example, in a pretrained transformer language model, the sentences Aand B are associated with different segment embeddings. Therefore, ψ andϕ play a similar role as the common context and target representationsthat appear in many neural embedding methods.

Where ƒ:

^(d)×

^(d)→

is a (parametric) similarity function, the ƒ 306 scores the similaritybetween sentence embeddings that are produced by ψ 302 and ϕ 306. Then,the student model S: Y×Y→

is defined asS _(yz)

ƒ(ψ(y),ϕ(z)).  (1)

Given a set of paired training sentences X={(y_(i),z_(i))}_(i=1) ^(N),the student DSE model S is trained such that for all (y,z)∈X, itssimilarity score S_(yz) approximates the teacher language model's scoreT_(yz) with a high accuracy. The student model learns the parameters viaa pairwise training procedure, explained below. In some sentence-pairtasks, the pretrained transformer teacher model's codomain ismultidimensional. For example, the MNLI task is to predict whether therelation between two sentences is neutral, contradictory or entailment.In this case, the codomain of the teacher model T is

³ and hence the codomain of the similarity function ƒ (and the studentmodel S) is

³ as well.

In pairwise training, a loss function

:

×

→

is defined and the student model S is trained to minimize

(S_(yz),T_(yz)) in an end-to-end fashion. Specifically, given asentence-pair (y,z)∈X×X, we compute the embeddings ψ(y) and ϕ(z) for thesentences y and z, respectively. Then, the similarity score S_(yz) iscomputed using the similarity function ƒ according to Eq. (1).

The

can be either a regression or classification loss depending on the taskat hand. Moreover,

can be trivially extended to support multiple teacher models. In someexamples, two teacher models T and R are used, where R is simply theground truth labels as follows:

_(yz) =αl _(dstl)(S _(yz) ,T _(yz))+(1−α)l _(lbl)(S _(yz) ,R _(yz))  (2)where α∈[0,1] is a hyperparameter that controls the relative amount ofsupervision that is induced by T and R. In this case, the student modelis simultaneously supervised by T and R. In general, the distillationloss l_(dstl) and the ground truth label loss l_(1bl) are not restrictedto be the same loss function.

The teacher model T 300 in some examples is implemented as a BERT-Largemodel, consisting of twenty-four encoder layers that each employ aself-attention mechanism. For a sentence-pair input, T employs CAbetween the two sentences. The teacher model 300 is a teacher languagemodel. The teacher language model is initialized to the pre-trainedversion and then fine-tuned according to each specific sentence-pairtask. In some examples, training and/or fine-tuning is performed usingtask training data 308 retrieved from a data store, such as, but notlimited to, the data storage device 122 in FIG. 1.

After the fine-tuning phase, the score T_(yz) is computed for asentence-pair (y,z) by propagating a unified representation of thesentence-pair throughout T. The score is then extracted from the outputlayer, which is placed on top of the last hidden representation of theCLS token. In some examples, T_(yz) is set to the logit value before thesoftmax function/sigmoid activation. The softmax function is amathematical function which may also be referred to as softargmax ornormalized exponential function.

In some examples, the teacher model 300 is based on BERT, which is notsymmetric due to its use of different segment embeddings for inputsentences. However, the examples are not limited to BERT as a teachermodel. For example, the teacher model 300 can also be implemented withXLNet or any other pretrained transformer language model teacher.

In other examples, to refrain from doubling the number of parameters,the system utilizes a symmetric student model by learning a singlemutual embedding function ψ (=ϕ)). The embedding function ψ 302 isimplemented using a BERT-Large model that operates on a single sentence(using only the segment embedding A) and outputs a vectorrepresentation. Specifically, given a sentence y, the CLS token is addedto the beginning of y and a SEP token to the end, before feeding theembedding function ψ with the resulted representation. Then, the modelcomputes the average pooling operation across the hidden tokens for eachof the last four encoder layers' outputs

Degradations in performance can be observed when including the CLS tokenin the pooling of hidden tokens, thus, in some examples, it is excludedfrom it. This is due to the fact that during pre-training the CLS tokenrepresentation is used for encoding information across two sentences forthe next sentence prediction task and may not be suitable in someexample for representing a single sentence. The pooling operationproduces four 1024-dimensional vectors (one for each encoder layer) thatare then concatenated to form a 4096-dimensional representation as thefinal sentence embedding (hence d=4096).

The similarity function:ƒ(u,v)=w ^(T) ReLU(Wh)  (3)can be used where h=[u, v, u ∘ v, |u−v|]∈

¹⁶³⁸⁴ (∘ stands for the Hadamard product), W∈

^(512×16384) and w∈

⁵¹². Both W and w are learnable parameters. In other examples, u, v∈

⁴⁰⁹⁶ are the sentences' representations that are produced by theembedding function ψ. Like the teacher model, ψ is initialized to thepre-trained version of BERT-Large language model.

In other examples, a loss function includes a linear combination of thedistillation and label losses. The distillation loss term is set to theL2 loss as follows:I _(dstl)(S _(yz) ,T _(yz))=l _(L2)(S _(yz) ,T _(yz))=∥S _(yz) ,T_(yz)∥₂ ².where high temperature values and minimization of the cross-entropy lossover the softmax outputs, is equivalent to minimizing L2 loss over thelogits before applying the softmax function. Using the L2 loss on thelogits produces superior distillation results. The label loss is setaccording to the task at hand. For a multiclass classification task, thelabel loss is set as follows:l _(lbl)(S _(yz) ,R _(yz))=l _(cce)(ρ(S _(yz)),R _(yz))where R_(yz)∈{0,1}^(n) is a one-hot vector, ρ(S_(yz))∈[0,1]^(n) is adiscrete probability distribution obtained by applying the softmaxfunction ρ, and l_(cce)(a,b)=−Σ_(i=1) ^(n)b_(i) log a_(i) is thecategorical cross entropy loss. For a binary classification task, thesame loss is used with n=2. For a regression task, in some examples, thelanguage model sets l_(lbl)(S_(yz), R_(yz))=l_(L2)(S_(yz), R_(yz)),where R_(yz)∈

.

FIG. 4 is an exemplary block diagram illustrating a database storingprecomputed candidate sentence representations. The database 400represents any type of database or other data store for storingorganized collections of data, such as, but not limited to, a relationaldatabase. The database 400 can be implemented on one or more datastorage devices, such as, but not limited to, the data storage device122 in FIG. 1.

In some examples, the database 400 stores a plurality of candidatesentences 402, a plurality of sentence representations 404 and/or aplurality of training similarity scores 406, such as, but not limitedto, a similarity score 408 generated by a pretrained teacher transformermodel.

The plurality of candidate sentences 402 includes a set of one or moresimilar sentences 420 which may be output in response 418 to a query,such as, but not limited to, a sentence A 410 and/or the sentence B 412.The plurality of sentence representations can include, for example butwithout limitation, a representation of sentence A 414 and/or arepresentation of sentence B 416.

FIG. 5 is an exemplary flow chart illustrating operation of thecomputing device to train a DSE language model by a pretrained teacherlanguage model. The process shown in FIG. 5 is performed by a languagemodel, executing on a computing device, such as the computing device1000 of FIG. 10.

The process begins by training a student model using a knowledgedistillation from pretrained transformer language model at 502. Thepretrained transformer language model is a teacher language modelpretrained for performing sentence similarity comparisons, such as, butnot limited to, the pretrained teacher language model 102 in FIG. 1.

A test sentence is input into the student language model at 504. Thetest sentence is a single sentence. The student language model generatessimilarity scores at 506. The similarity scores include one or morescores which rank or score the test sentence paired with each candidatesentence in a plurality of candidate sentences using the representationsof the test sentence and the candidate sentences, such as, but notlimited to, the similarity score 112 in FIG. 1. The student modelsimilarity score(s) are compared to the teacher model similarity scoresat 508. A determination is made as to whether the scores are similar at510. If no, training of the student language model continues at 512. Theprocess executes operations 502 through 510 until the student model istrained such that it generates similarity scores which are the same orsimilar to the scores generated by the teacher model for the samesentence pairs. The process terminates thereafter.

While the operations illustrated in FIG. 5 are performed by a computingdevice, aspects of the disclosure contemplate performance of theoperations by other entities. In a non-limiting example, a cloud serviceperforms one or more of the operations. In another example, one or morecomputer-readable storage media storing computer-readable instructionsmay execute to cause at least one processor to implement the operationsillustrated in FIG. 5.

FIG. 6 is an exemplary flow chart illustrating operation of thecomputing device to perform precomputation of candidate sentenceembedding. The process shown in FIG. 6 is performed by a language model,executing on a computing device, such as the computing device 1000 ofFIG. 10.

The process begins by decoupling a transformer language model togenerate a trained DSE model at 602. The DSE model is a sentenceembedding model for mapping sentences to feature vectors to generaterepresentations of sentences, such as, but not limited to, the DSE model200 in FIG. 2. The DSE model generates a plurality of candidate sentencerepresentations at 604. A representation in the plurality ofrepresentations is a vector or other representation of a sentence, suchas, but not limited to, the sentence representation 120 in FIG. 1. Therepresentations are stored at 606. The representations can be stored ina database or other data store, such as, but not limited to, thedatabase 400 in FIG. 4. A determination is made as to whether an inputsentence is received at 608. The input sentence in some examples is aquery entered by a user. If the input sentence is received, the DSEmodel embeds the sentence at 610. Embedding the sentence refers toconverting the input sentence into a representation. The model comparesthe representation of the input sentence (primary sentence) to thestored representations for each candidate sentence at 612. The DSE modelgenerates similarity scores at 614. Each score in the set of similarityscores indicates the similarity of the input sentence with a givencandidate sentence. The process terminates thereafter.

While the operations illustrated in FIG. 6 are performed by a computingdevice, aspects of the disclosure contemplate performance of theoperations by other entities. In a non-limiting example, a cloud serviceperforms one or more of the operations. In another example, one or morecomputer-readable storage media storing computer-readable instructionsmay execute to cause at least one processor to implement the operationsillustrated in FIG. 6.

FIG. 7 is an exemplary flow chart illustrating operation of thecomputing device to perform sentence pair similarity computation by aDSE model trained via knowledge distillation from a pretrainedtransformer language model. The process shown in FIG. 7 is performed bya language model, executing on a computing device, such as the computingdevice 1000 of FIG. 10.

The process begins by receiving a query at 702. The query includes atleast one sentence for similarity comparison with two or more candidatesentences, such as, but not limited to, the query 202 in FIG. 2. The DSElanguage model embeds the sentence in the query at 704. The DSE languagemodel compares the representation of the input sentence to candidatesentence representations at 706. The DSE language model generates ascore at 708. The score indicates the similarity between the querysentence and a selected candidate sentence. A determination is made asto whether there is a next candidate sentence in the plurality ofcandidate sentences for similarity comparison with the input querysentence at 710. If yes, the DSE model iteratively compares eachcandidate sentence representation to the representation of the querysentence until the query sentence has been compared to all the candidatesentences at 710. The DSE language model selects similar sentences basedon the scores at 712. The system outputs sentences similar to the querysentence in a query response at 714. The query response is a responsesuch as, but not limited to the response 418 in FIG. 4. The processterminates thereafter.

FIG. 8 is an exemplary table 800 illustrating time comparisons betweenthe DSE model and the trained transformer language model performingsimilarity computation between sentence pairs with a catalog of onethousand candidate sentences. The table 800 shows time comparisonsbetween the DSE language model and the pretrained transformer languagemodel for an offline computation of one thousand sentence-pairssimilarities for a set of one thousand candidate sentences. In thisexample, the DSE language model generates a similarity score for eachcombination of a selected (primary) sentence and each of the candidatesentences for a total of one thousand similarity scores. The DSElanguage model completes the computations in less time and/or withimproved efficiency over the pretrained transformer language model whichcomputes similarity for sentence pairs including the selected sentencewith each of the candidate sentences.

FIG. 9 is an exemplary table 900 illustrating time comparisons betweenthe DSE model and the trained transformer language model performingsimilarity determination between a query and a catalog of one hundredthousand candidate sentences. The table 900 illustrates thecomputational time differences between the DSE language model and thepretrained transformer language model for an online computation ofsentence similarities between a query sentence and a set of one hundredthousand sentences.

The computational time indicates the amount of time used by each modelfor computing sentence similarities for a selected (primary) sentencewith each candidate sentence in a set of one hundred thousand candidatesentences. In this example, each model generates one hundred thousandsimilarity scores where each score represents the similarity of theselected input sentence with each candidate sentence.

Additional Examples

Some aspects and examples disclosed herein are directed to a system,method and/or computer executable instructions for attentive sentencesimilarity scoring comprising: a processor; and a computer-readablemedium storing instructions that are operative upon execution by theprocessor to: train a distilled sentence embedding (DSE) language modelby decoupling a transformer language model using knowledge distillationto calculate sentence embeddings for a plurality of candidate sentences;precompute a plurality of candidate sentence representations by thetrained DSE language model; store the precomputed plurality of candidatesentence representations in a data storage device; generate a set ofsimilarity scores for each candidate sentence in the plurality ofcandidate sentences and a selected sentence associated with an inputquery based on a representation of the selected sentence and arepresentation of each candidate sentence in the plurality of candidatesentences, the set of similarity scores comprising at least onesimilarity score for each candidate sentence compared with the selectedsentence; select a set of similar sentences from the plurality ofcandidate sentences based on the set of similarity scores; and output,via a user interface device, the set of similar sentences associatedwith a query response.

Additional aspects and examples disclosed herein are directed to asystem, method or computer executable instructions for training adistilled sentence embedding (DSE) language model by decoupling atransformer language model using knowledge distillation to calculatesentence embeddings for a plurality of candidate sentences; precomputinga plurality of candidate sentence representations by the trained DSElanguage model; storing the precomputed plurality of candidate sentencerepresentations in a data storage device; generating a set of similarityscores for each candidate sentence in the plurality of candidatesentences and a selected sentence associated with an input query basedon a representation of the selected sentence and a representation ofeach candidate sentence in the plurality of candidate sentences, the setof similarity scores comprising at least one similarity score for eachcandidate sentence compared with the selected sentence; selecting a setof similar sentences from the plurality of candidate sentences based onthe set of similarity scores; and outputting, via a user interface, theset of similar sentences associated with a query response.

Additional aspects and examples disclosed herein are directed to asystem, method and/or one or more computer storage devices havingcomputer-executable instructions stored thereon for attentive sentencesimilarity scoring, which, on execution by a computer, cause thecomputer to perform operations comprising: training a distilled sentenceembedding (DSE) language model by decoupling a transformer languagemodel using knowledge distillation to calculate sentence embeddings fora plurality of candidate sentences; precomputing a plurality ofcandidate sentence representations by the trained DSE language model;storing the precomputed plurality of candidate sentence representationsin a data storage device; generating a set of similarity scores for eachcandidate sentence in the plurality of candidate sentences and aselected sentence associated with an input query based on arepresentation of the selected sentence and a representation of eachcandidate sentence in the plurality of candidate sentences, the set ofsimilarity scores comprising at least one similarity score for eachcandidate sentence compared with the selected sentence; selecting a setof similar sentences from the plurality of candidate sentences based onthe set of similarity scores; and outputting the set of similarsentences associated with a query response.

Other examples provide a decoupled sentence similarity model providingfast text-based similarity scoring. The DSE model provides accelerationof the latest transformer-based models for fast sentence similaritycomputation using a knowledge distillation method to decouple thesentence embedding function from the similarity scoring function of thelanguage model.

The system, in other examples, provides a method of training a neuralnetwork for comparing the similarities between a primary sentence andseveral candidate sentences. The DSE model provides a lightweightfunction that does not have the processing cost of comparing every wordof the primary sentence to each word of the candidate sentences(cross-attention). This is achieved by training a student model from afully trained cross-attention operation that can then distill sentencesdown to candidate vectors that are then more quickly compared. Due topre-computation of the candidate sentence vectors, newly received querysentences only requires simple calculation comparing sentence vectorspermits the model to retrieve results more quickly than the pretrainedtransformer language models.

In still other examples, the system de-couples a transformer languagemodel (making it into a sentence embedding model). Instead of inputtingtwo sentences together into the transformer language model forsimilarity computation together, a single input sentence is provided tothe DSE model for extraction and embedding to create a representationfor that sentence. The DSE model then compares the representation of theinput sentence with pre-generated representations of candidate sentencesand calculate a similarity score for each input sentence and candidatesentence pair.

Alternatively, or in addition to the other examples described herein,examples include any combination of the following:

-   -   a transformer language model, wherein the DSE language model is        trained by decoupling the transformer language model to perform        sentence similarity comparisons using knowledge distillation;    -   perform a comparison function, by the scoring component, on the        selected sentence representation and a representation of a        selected candidate sentence to generate a similarity score for        the selected candidate sentence and the selected sentence;    -   a user interface device, wherein the response is output to at        least one user via the user interface device;    -   a student language model, wherein the student language model is        trained by comparing test similarity scores ranking a test        sentence paired with each candidate sentence in the plurality of        candidate sentences with training similarity scores generated by        a trained teacher model to determine whether to continue        training the student language model;    -   the trained DSE language model maps the selected sentence to a        set of vectors in latent space to generate a representation of        the selected sentence for comparison with the plurality of        candidate sentence representations; and    -   wherein the trained DSE language model calculates sentence        embeddings for a plurality of candidate sentences.

While the aspects of the disclosure have been described in terms ofvarious examples with their associated operations, a person skilled inthe art would appreciate that a combination of operations from anynumber of different examples is also within scope of the aspects of thedisclosure.

Example Operating Environment

FIG. 10 is a block diagram of an example computing device 1000 forimplementing aspects disclosed herein and is designated generally ascomputing device 1000. Computing device 1000 is an example of a suitablecomputing environment and is not intended to suggest any limitation asto the scope of use or functionality of the examples disclosed herein.Neither should computing device 1000 be interpreted as having anydependency or requirement relating to any one or combination ofcomponents/modules illustrated. The examples disclosed herein may bedescribed in the general context of computer code or machine-useableinstructions, including computer-executable instructions such as programcomponents, being executed by a computer or other machine, such as apersonal data assistant or other handheld device.

Generally, program components including routines, programs, objects,components, data structures, and the like, refer to code that performsparticular tasks, or implement particular abstract data types. Thedisclosed examples may be practiced in a variety of systemconfigurations, including personal computers, laptops, smart phones,mobile tablets, hand-held devices, consumer electronics, specialtycomputing devices, etc. The disclosed examples may also be practiced indistributed computing environments when tasks are performed byremote-processing devices that are linked through a communicationsnetwork.

Computing device 1000 includes a bus 1010 that directly or indirectlycouples the following devices: computer-storage memory 1012, one or moreprocessors 1014, one or more presentation components 1016, I/O ports1018, I/O components 1020, a power supply 1022, and a network component1024. While computing device 1000 is depicted as a seemingly singledevice, multiple computing devices 1000 may work together and share thedepicted device resources. For example, memory 1012 may be distributedacross multiple devices, and processor(s) 1014 may be housed withdifferent devices.

Bus 1010 represents what may be one or more busses (such as an addressbus, data bus, or a combination thereof). Although the various blocks ofFIG. 10 are shown with lines for the sake of clarity, delineatingvarious components may be accomplished with alternative representations.For example, a presentation component such as a display device is an I/Ocomponent in some examples, and some examples of processors have theirown memory. Distinction is not made between such categories as“workstation,” “server,” “laptop,” “hand-held device,” etc., as all arecontemplated within the scope of FIG. 10 and the references herein to a“computing device.” Memory 1012 may take the form of thecomputer-storage media references below and operatively provide storageof computer-readable instructions, data structures, program modules andother data for computing device 1000. In some examples, memory 1012stores one or more of an operating system, a universal applicationplatform, or other program modules and program data. Memory 1012 is thusable to store and access data 1012 a and instructions 1012 b that areexecutable by processor 1014 and configured to carry out the variousoperations disclosed herein.

In some examples, memory 1012 includes computer-storage media in theform of volatile and/or nonvolatile memory, removable or non-removablememory, data disks in virtual environments, or a combination thereof.Memory 1012 may include any quantity of memory associated with oraccessible by computing device 1000. Memory 1012 may be internal tocomputing device 1000 (as shown in FIG. 10), external to computingdevice 1000 (not shown), or both (not shown). Examples of memory 1012 ininclude, without limitation, random access memory (RAM); read onlymemory (ROM); electronically erasable programmable read only memory(EEPROM); flash memory or other memory technologies; CD-ROM, digitalversatile disks (DVDs) or other optical or holographic media; magneticcassettes, magnetic tape, magnetic disk storage or other magneticstorage devices; memory wired into an analog computing device; or anyother medium for encoding desired information and for access bycomputing device 1000. Additionally, or alternatively, memory 1012 maybe distributed across multiple computing devices 1000, for example, in avirtualized environment in which instruction processing is carried outon multiple computing devices 1000. For the purposes of this disclosure,“computer storage media,” “computer-storage memory,” “memory,” and“memory devices” are synonymous terms for computer-storage memory 1012,and none of these terms include carrier waves or propagating signaling.

Processor(s) 1014 may include any quantity of processing units that readdata from various entities, such as memory 1012 or I/O components 1020and may include CPUs and/or GPUs. Specifically, processor(s) 1014 areprogrammed to execute computer-executable instructions for implementingaspects of the disclosure. The instructions may be performed by theprocessor, by multiple processors within computing device 1000, or by aprocessor external to client computing device 1000. In some examples,processor(s) 1014 are programmed to execute instructions such as thoseillustrated in the in the accompanying drawings. Moreover, in someexamples, processor(s) 1014 represent an implementation of analogtechniques to perform the operations described herein. For example, theoperations may be performed by an analog client computing device 1000and/or a digital client computing device 1000. Presentation component(s)1016 present data indications to a user or other device. Exemplarypresentation components include a display device, speaker, printingcomponent, vibrating component, etc. One skilled in the art willunderstand and appreciate that computer data may be presented in anumber of ways, such as visually in a graphical user interface (GUI),audibly through speakers, wirelessly between computing devices 1000,across a wired connection, or in other ways. I/O ports 1018 allowcomputing device 1000 to be logically coupled to other devices includingI/O components 1020, some of which may be built in. Example I/Ocomponents 1020 include, for example but without limitation, amicrophone, joystick, game pad, satellite dish, scanner, printer,wireless device, etc.

Computing device 1000 may operate in a networked environment via networkcomponent 1024 using logical connections to one or more remotecomputers. In some examples, network component 1024 includes a networkinterface card and/or computer-executable instructions (e.g., a driver)for operating the network interface card. Communication betweencomputing device 1000 and other devices may occur using any protocol ormechanism over any wired or wireless connection. In some examples,network component 1024 is operable to communicate data over public,private, or hybrid (public and private) using a transfer protocol,between devices wirelessly using short range communication technologies(e.g., near-field communication (NFC), Bluetooth branded communications,or the like), or a combination thereof. Network component 1024communicates over wireless communication link 1026 and/or a wiredcommunication link 1026 a to a cloud resource 1028 across network 1030.Various different examples of communication links 1026 and 1026 ainclude a wireless connection, a wired connection, and/or a dedicatedlink, and in some examples, at least a portion is routed through theinternet.

Although described in connection with an example computing device 1000,examples of the disclosure are capable of implementation with numerousother general-purpose or special-purpose computing system environments,configurations, or devices. Examples of well-known computing systems,environments, and/or configurations that may be suitable for use withaspects of the disclosure include, but are not limited to, smart phones,mobile tablets, mobile computing devices, personal computers, servercomputers, hand-held or laptop devices, multiprocessor systems, gamingconsoles, microprocessor-based systems, set top boxes, programmableconsumer electronics, mobile telephones, mobile computing and/orcommunication devices in wearable or accessory form factors (e.g.,watches, glasses, headsets, or earphones), network PCs, minicomputers,mainframe computers, distributed computing environments that include anyof the above systems or devices, virtual reality (VR) devices, augmentedreality (AR) devices, mixed reality (MR) devices, holographic device,and the like. Such systems or devices may accept input from the user inany way, including from input devices such as a keyboard or pointingdevice, via gesture input, proximity input (such as by hovering), and/orvia voice input.

Examples of the disclosure may be described in the general context ofcomputer-executable instructions, such as program modules, executed byone or more computers or other devices in software, firmware, hardware,or a combination thereof. The computer-executable instructions may beorganized into one or more computer-executable components or modules.Generally, program modules include, but are not limited to, routines,programs, objects, components, and data structures that performparticular tasks or implement particular abstract data types. Aspects ofthe disclosure may be implemented with any number and organization ofsuch components or modules. For example, aspects of the disclosure arenot limited to the specific computer-executable instructions or thespecific components or modules illustrated in the figures and describedherein. Other examples of the disclosure may include differentcomputer-executable instructions or components having more or lessfunctionality than illustrated and described herein. In examplesinvolving a general-purpose computer, aspects of the disclosuretransform the general-purpose computer into a special-purpose computingdevice when configured to execute the instructions described herein.

By way of example and not limitation, computer readable media comprisecomputer storage media and communication media. Computer storage mediainclude volatile and nonvolatile, removable and non-removable memoryimplemented in any method or technology for storage of information suchas computer readable instructions, data structures, program modules, orthe like. Computer storage media are tangible and mutually exclusive tocommunication media. Computer storage media are implemented in hardwareand exclude carrier waves and propagated signals. Computer storage mediafor purposes of this disclosure are not signals per se.

Exemplary computer storage media include hard disks, flash drives,solid-state memory, phase change random-access memory (PRAM), staticrandom-access memory (SRAM), dynamic random-access memory (DRAM), othertypes of random-access memory (RAM), read-only memory (ROM),electrically erasable programmable read-only memory (EEPROM), flashmemory or other memory technology, compact disk read-only memory(CD-ROM), digital versatile disks (DVD) or other optical storage,magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, or any other non-transmission medium that canbe used to store information for access by a computing device. Incontrast, communication media typically embody computer readableinstructions, data structures, program modules, or the like in amodulated data signal such as a carrier wave or other transportmechanism and include any information delivery media.

The order of execution or performance of the operations in examples ofthe disclosure illustrated and described herein is not essential and maybe performed in different sequential manners in various examples. Forexample, it is contemplated that executing or performing a particularoperation before, contemporaneously with, or after another operation iswithin the scope of aspects of the disclosure. When introducing elementsof aspects of the disclosure or the examples thereof, the articles “a,”“an,” “the,” and “said” are intended to mean that there are one or moreof the elements. The terms “comprising,” “including,” and “having” areintended to be inclusive and mean that there may be additional elementsother than the listed elements. The term “exemplary” is intended to mean“an example of” The phrase “one or more of the following: A, B, and C”means “at least one of A and/or at least one of B and/or at least one ofC.”

Having described aspects of the disclosure in detail, it will beapparent that modifications and variations are possible withoutdeparting from the scope of aspects of the disclosure as defined in theappended claims. As various changes could be made in the aboveconstructions, products, and methods without departing from the scope ofaspects of the disclosure, it is intended that all matter contained inthe above description and shown in the accompanying drawings shall beinterpreted as illustrative and not in a limiting sense.

What is claimed is:
 1. A system for attentive sentence similarityscoring, the system comprising: a computer-readable medium storinginstructions that are operative upon execution by a processor to:precompute, by an embedding component associated with a traineddistilled sentence embedding (DSE) language model, a plurality ofcandidate sentence representations representing each candidate sentencein the plurality of candidate sentences; store the precomputed pluralityof candidate sentence representations in a data storage device;generate, by a scoring component, a set of similarity scores for eachcandidate sentence in the plurality of candidate sentences and aselected sentence associated with an input query based on arepresentation of the selected sentence and a representation of eachcandidate sentence in the plurality of candidate sentences, the set ofsimilarity scores comprising at least one similarity score for eachcandidate sentence compared with the selected sentence; select, by aretrieval component, a set of similar sentences from the plurality ofcandidate sentences based on the set of similarity scores; and generatea response to the input query comprising the selected set of similarsentences.
 2. The system of claim 1, further comprising: a transformerlanguage model, wherein the DSE language model is trained by decouplingthe transformer language model to perform sentence similaritycomparisons using knowledge distillation.
 3. The system of claim 1,further comprising: perform a comparison function, by the scoringcomponent, on the selected sentence representation and a representationof a selected candidate sentence to generate a similarity score for theselected candidate sentence and the selected sentence.
 4. The system ofclaim 1, further comprising: a user interface device, wherein theresponse is output to at least one user via the user interface device.5. The system of claim 1, further comprising: a student language model,wherein the student language model is trained by comparing testsimilarity scores ranking a test sentence paired with each candidatesentence in the plurality of candidate sentences with trainingsimilarity scores generated by a trained teacher model to determinewhether to continue training the student language model.
 6. The systemof claim 1, further comprising: the trained DSE language model maps theselected sentence to a set of vectors in latent space to generate arepresentation of the selected sentence for comparison with theplurality of candidate sentence representations.
 7. The system of claim1, wherein the trained DSE language model calculates sentence embeddingsfor a plurality of candidate sentences.
 8. A method of attentivesentence similarity scoring, the method comprising: precomputing, by anembedding component associated with a trained distilled sentenceembedding (DSE) language model, a plurality of candidate sentencerepresentations representing each candidate sentence in a plurality ofcandidate sentences; storing the precomputed plurality of candidatesentence representations in a data storage device; generating, by ascoring component, a set of similarity scores for each candidatesentence in the plurality of candidate sentences and a selected sentenceassociated with an input query based on a representation of the selectedsentence and a representation of each candidate sentence in theplurality of candidate sentences, the set of similarity scorescomprising at least one similarity score for each candidate sentencecompared with the selected sentence; selecting, by a retrievalcomponent, a set of similar sentences from the plurality of candidatesentences based on the set of similarity scores; and generating aresponse to the input query comprising the selected set of similarsentences.
 9. The method of claim 8, further comprising: training theDSE language model by decoupling a transformer language model to performsentence similarity comparisons using knowledge distillation.
 10. Themethod of claim 8, further comprising: performing a comparison function,by the scoring component, on the selected sentence representation and arepresentation of a selected candidate sentence to generate a similarityscore for the selected candidate sentence and the selected sentence. 11.The method of claim 8, further comprising: outputting the response to atleast one user via a user interface device.
 12. The method of claim 8,further comprising: training a student language model using at least onetest sentence, wherein similarity scores ranking a test sentence pairedwith each candidate sentence in the plurality of candidate sentencesgenerated by the student language model are compared with trainingsimilarity scores generated by a trained teacher model to determinewhether to continue training the student language model.
 13. The methodof claim 8, further comprising: mapping, by the trained DSE languagemodel, the selected sentence to a set of vectors in latent space togenerate a representation of the selected sentence for comparison withthe plurality of candidate sentence representations.
 14. The method ofclaim 8, wherein the trained DSE language model calculates sentenceembeddings for a plurality of candidate sentences.
 15. One or morecomputer storage devices having computer-executable instructions storedthereon for attentive sentence similarity scoring, which, on executionby a computer, cause the computer to perform operations comprising:precompute, by an embedding component associated with a traineddistilled sentence embedding (DSE) language model, a plurality ofcandidate sentence representations representing each candidate sentencein a plurality of candidate sentences; store the precomputed pluralityof candidate sentence representations in a data storage device;generate, by a scoring component, a set of similarity scores for eachcandidate sentence in the plurality of candidate sentences and aselected sentence associated with an input query based on arepresentation of the selected sentence and a representation of eachcandidate sentence in the plurality of candidate sentences, the set ofsimilarity scores comprising at least one similarity score for eachcandidate sentence compared with the selected sentence; select, by aretrieval component, a set of similar sentences from the plurality ofcandidate sentences based on the set of similarity scores; andgenerating a response to the input query comprising the selected set ofsimilar sentences.
 16. The one or more computer storage devices of claim15, wherein the operations further comprise: training the DSE languagemodel by decoupling a transformer language model to perform sentencesimilarity comparisons using knowledge distillation, wherein the trainedDSE language model calculates sentence embeddings for a plurality ofcandidate sentences.
 17. The one or more computer storage devices ofclaim 15, wherein the operations further comprise: performing acomparison function, by the scoring component, on the selected sentencerepresentation and a representation of a selected candidate sentence togenerate a similarity score for the selected candidate sentence and theselected sentence.
 18. The one or more computer storage devices of claim15, wherein the operations further comprise: outputting the response toat least one user via a user interface device.
 19. The one or morecomputer storage devices of claim 15, wherein the operations furthercomprise: training a student language model using at least one testsentence, wherein similarity scores ranking a test sentence paired witheach candidate sentence in the plurality of candidate sentencesgenerated by the student language model are compared with trainingsimilarity scores generated by a trained teacher model to determinewhether to continue training the student language model.
 20. The one ormore computer storage devices of claim 15, wherein the operationsfurther comprise: mapping, by the trained DSE language model, theselected sentence to a set of vectors in latent space to generate arepresentation of the selected sentence for comparison with theplurality of candidate sentence representations.